Sarah Piombo
Natalia Zemlianskaia
George G. Vega Yon
August 16th, 2021
What is R and Rstudio?
Getting help with R.
A live example with ggplot2.
R is a language and environment for statistical computing and graphics. — https://r-project.org
RStudio is an integrated development environment (IDE) for R. — https://rstudio.org/products/rstudio
Let’s see a live view of RStudio!…
All the code for this section can be downloaded here. The entire presentation (which contains the code) was generated using RMarkdown and can be downloaded from here.
(you will learn more about RMarkdown in day 3!)
The line library(ggplot2) loaded the package ggplot2.
The line data("diamonds") loaded the dataset diamonds from the ggplot package.
To get help regarding a function, we can use the help("<FUNCTION NAME>") command in R, for example, if we wanted to learn more about library(), we could just type
Or also equally valid
(let’s checkout how does the help file looks like!)
What other arguments does the function data() accepts?
What does the function str does?
How does data look like in R? There are many ways to represent data in R. One of the most flexible (popular?) ways of doing is through data frames (in the case of “base R”, the core component of R) and tibbles (in the case of the tidyverse). Tibbles/data frames share the same structure:
For example, here is how R prints a tibble and a data.frame:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
And a data frame version of the same data:
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
R has functions to query/ask how many rows and columns these objects have, we can use the nrow and ncol functions as follows:
## [1] 53940
## [1] 10
Now let’s get our hands dirty and do some visualization!
The ggplot2 R package is for sure the most popular way to build plots in R. Here we will be looking at a couple of examples using the diamond dataset that we just loaded.
The overall structure of ggplot is as follows:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot() function sets up the data that we will be using<GEOM_FUNCTION>() actually tells what type of plot are we building (histogram, scatterplot, barplot, etc.)aes(<MAPPINGS>) indicates how features (columns) of the data are to be included in the plot.+ sign at the end of the line binds things together (we can add many layers/components to a single plot!)Let’s see what happens if we run the following code?
Nothing! Because we haven’t told ggplot what we want to visualize. The function only knows that we would like to work with the diamonds dataset, but it has no idea of what to plot!
Let’s try again using the following code
Error: geom_point requires the following missing aesthetics: x and y
Run `rlang::last_error()` to see where the error occurred.
Ups! We got an error, and the error says "geom_point requires the following missing aesthetics: x and y", which means that we still need to give ggplot a bit more of information about what we would like to visualize. Saying that we want a scatter plot without indicating what are the variables is meaningless.
So let’s try again one more time and see what we get!
How does the color affect the price?
Now, how about clarity of the diamond?
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price, color = color)) +
facet_wrap(~clarity)Finally, let’s add some titles to make it look nicer
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price, color = color)) +
facet_wrap(~clarity) +
labs(
title = "Price of Diamonds (by clarity)",
subtitle = "data from the ggplot2 R package",
x = "Weight of the diamond (carat)",
y = "Price in US dollars",
color = "Color from \n J (worst) to D (best)"
)ggplot2, there are dozends of other R packages that extend ggplot2! https://exts.ggplot2.tidyverse.org/gallery/Reproduce the last plot but this time put carat in the y axis and price in the x axis.
Using the "mpg" dataset (which can be loaded using data(mpg)), draw a similar plot using the following mappings aes(x = displ, y = hwy, color = drv). Fill in the missing pieces to get the plot:
data(< DATA >)
ggplot(data = < DATA >) +
geom_point(mapping = < MAPPINGS >) +
labs(
title = "Fuel economy data",
subtitle = "(1999 - 2008)",
x = "Engine displacement (liters)",
y = "Highway MPG",
color = "Drive train"
)
ggplot(data = diamonds) +
geom_point(mapping = aes(x = price, y = carat, color = color)) +
facet_wrap(~clarity) +
labs(
title = "Price of Diamonds (by clarity)",
subtitle = "data from the ggplot2 R package",
y = "Weight of the diamond (carat)",
x = "Price in US dollars",
color = "Color from \n J (worst) to D (best)"
)data(mpg)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
labs(
title = "Fuel economy data",
subtitle = "(1999 - 2008)",
x = "Engine displacement (liters)",
y = "Highway MPG",
color = "Drive train"
)“R for data science” (free online book) https://r4ds.had.co.nz/
The R graph gallery https://www.r-graph-gallery.com/
The bookdown website (tons of free books about R) https://bookdown.org/
“R Markdown: The Definitive Guide” (free online book) https://bookdown.org/yihui/rmarkdown/
“RStudio Premiers” (online interactive tutorials with R) https://rstudio.cloud/learn/primers
RStudio Webinars https://rstudio.com/resources/webinars/